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Abstract 


This  paper  introduces  an  objective  shape-identification  task  for  measuring  the  kinetic  depth 
effect.  The  observer  views  an  array  of  many  randomly  positioned  dots  that  move  from  frame  to 
frame.  The  dot  motions  define  a  3D  shape  consisting  of  bumps  and  depressions  on  an  otherwise 
flat  ground.  On  each  trial,  a  presented  shape  is  chosen  from  a  large  lexicon  of  shapes  that  vary  in 
size,  position,  and  number  of  bumps.  The  observer’s  task  is  to  identify  the  shape  and  its  overall 
direction  of  rotation.  Identification  accuracy  in  the  3D  shape  identification  task  is  an  objective 
measure,  with  a  low  guessing  base  rate,  of  the  observer’s  perceptual  ability  to  reconstruct  a  global 
2D  motion  flow  field.  (1)  Objective  accuracy  data  are  shown  to  be  generally  consistent  with 
previously  obtained  subjective  rating  judgments  of  depth  and  coherence.  (2)  Along  with  motion 
cues,  the  rotation  of  real  3D  dot-defined  shapes  inevitably  produces  a  cue  of  changing  dot  density. 
By  using  a  dot-lifetime  manipulation,  to  control  dot  density  in  our  computer  generated  shapes,  we 
show  that  changing  density  is  neither  necessary  nor  sufficient  to  account  for  observer’s 
performance;  i.e.,  motion  is  sufficient  for  the  KDE.  (3)  The  extraction  of  motion  cues  from  six 
optimally  relevant  locations  would  support  perfect  KDE  performance  with  our  stimuli.  A 
simplified  2D  motion  identification  task  with  6  perceptually  flat  flow-fields  was  derived  from  the 
3D  KDE  task.  Subjects’  performance  in  the  2D  and  3D  tasks  is  equivalent  This  indicates  (4) 
that  the  information  processing  capacity  in  KDE  is  comparable  to  information  processing  in  other 
domains.  (5)  The  core  of  the  structure  from  motion  algorithm  that  is  proposed  to  underlie  the 
current  task  consists  of  locating  relative  minima  and  maxima  of  local  velocity  and  assigning  3D 
depths  proportional  to  velocity  at  these  locations. 
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Introduction 

In  1953,  Wallach  and  O’Connell  described  a  depth  percept  derived  from  motion  cues  that 
they  called  the  ‘Kinetic  Depth  Effect’  (KDE).  Since  that  time,  there  has  been  a  great  deal  of 
research  on  the  KDE,  examining  the  effects  of  stimulus  parameters  such  as  dot  numerosity  in 
multi-dot  displays  (Green,  1961;  Braunstein,  1962),  frame  timing  (Petersik,  1980),  occlusion 
(Andersen  &  Braunstein,  1983;  Proffitt,  Bertenthal,  &  Roberts,  1984),  the  detection  of  non¬ 
rigidity  in  the  three-dimensional  form  most  consistent  with  the  stimulus  (Todd,  1982),  and 
veridicality  of  the  percept  (Todd,  1984a, b). 

Since  1979,  there  have  been  numerous  attempts  at  modeling  how  observers  and  machines 
could  derive  three-dimensional  structure  from  two-dimensional  motion  cues.  Ullman  (1979) 
referred  to  this  computational  task  as  the  ‘Structure  from  Motion’  problem.  Ironically,  Ullman ’s 
model  and  most  ensuing  ones  do  not  explicitly  use  motion  cues.  These  models  are  essentially 
geometry  theorems  concerning  the  minimal  number  of  points  and  views  needed  to  specify  the 
shape  under  various  simplifying  constraints  such  as  assumed  object  rigidity  and  assumed  parallel 
perspective  (Ullman,  1979;  Webb  &  Aggarwal,  1981;  Hoffman  &  Flinchbaugh,  1982;  Hoffman 
&  Bennett,  1985;  Bennett  &  Hoffman,  1985).  From  the  geometric  models,  iterative  models 
developed  that  use  newly  arrived  position  data  not  to  derive  the  tnre  structure  but  to  improve  the 
current  three-dimensional  representation-in  the  sense  of  maximize  fs  rigidity  (Ullman,  1984; 
Landy,  1987).  Only  a  few  models  actually  use  point  velocity  (i.e.  an  optic  flow  field)  in  addition 
to  point  position  (e.g.,  Clocksin,  1980;  Longuet-Higgins  &  Prazdny,  1980;  Koenderink  &  van 
Doom,  1986),  and  one  model  also  uses  point  acceleration  (Hoffman,  1982). 
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It  has  been  difficult  to  relate  models  of  the  KDE  to  the  results  of  psychological  studies.  An 
important  component  of  the  problem  has  been  the  difficulty  of  finding  an  appropriate 
experimental  paradigm.  Many  KDE  experiments  have  used  subjective  ratings  of  ‘depth’  or 
‘rigidity’  or  ‘coherence’  as  the  responses  (see  Dosher,  Landy,  and  Spelling,  1987  for  a  review). 
Relating  subjective  responses  to  a  process  model  of  KDE  is  problematic.  Typically,  a  structure- 
from-motion  model  yields  a  shape  specification.  To  link  the  derived  shape  to  subjective 
judgments,  and  thereby  to  experimental  results,  a  decision-making  apparatus  to  predict  judgments 
is  needed,  and  this  may  be  quite  complex. 

Objective  Measurements  of  KDE:  The  Problems 

Since  the  ability  to  derive  structure  from  motion  presumably  evolved  to  solve  an  objective 
environmental  problem,  a  better  approach  to  studying  KDE  is  to  measure  the  accuracy  of  the 
KDE  in  an  objective  fashion.  Does  the  observer  perceive  the  correct  shape  in  a  display?  The 
correct  depths?  The  correct  depth  order?  The  correct  curvature?  Some  of  the  studies  cited  above 
have  attempted  to  answer  such  questions  by  using  objective  response  criteria  (e.g.,  percent  correct 
in  a  one-  or  two-interval  forced-choice  task).  Unfortunately,  in  almost  every  case,  subjects  can 
achieve  good  performance  on  the  task  by  neglecting  perceived  depth  and  consciously  or 
unconsciously  formulating  their  responses  on  the  basis  of  other  cues.  In  these  cases,  there  is  a 
simple  non-KDE  cue  sufficient  to  make  the  judgment  accurately.  Although  the  subject  may  not 
consciously  be  using  these  artifactual  cues  to  make  correct  judgments,  we  cannot  be  sure  of  the 
basis  of  the  response  until  the  artifactual  cues  have  been  eliminated  or  rendered  useless  (e.g., 
through  irrelevant  variation). 
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Let  us  consider  some  examples.  Lappin,  Doner,  &  Kottas  (1980)  presented  subjects  with  a 
two-frame  representation  of  dots  randomly  positioned  on  the  surface  of  an  opaque  rotating  sphere 
displayed  by  polar  projection.  On  the  second  frame,  a  small  percentage  of  the  dots  were  deleted 
and  replaced  with  new  random  dots.  Subjects  were  required  to  determine  which  of  two  such 
two-frame  displays  had  a  higher  signal-to-noise  ratio  (in  terms  of  dot  correspondences).  The 
authors  interpret  their  results  in  terms  of  the  “minimal  conditions  for  the  visual  detection  of 
structure  and  motion  in  three  dimensions,’’  which  is  the  title  of  their  article.  Indeed,  the  signal 
dots  represent  two  frames  of  a  rigid  rotating  sphere.  But,  subjects  do  not  need  to  correctly 
perceive  a  3D  sphere  in  order  to  make  a  correct  response.  There  is  no  analysis  offered  of  how  far 
a  3D  perception  could  diverge  from  spherical  and  still  yield  the  observed  accuracy  of  response. 
Alternatively,  subjects  might  base  their  responses  on  perceived  2D  flow  fields,  judging  the 
percentage  of  dots  in  the  first  frame  that  have  corresponding  dots  in  the  second  frame.  This  2D 
judgment  need  not  utilize  the  entire  motion  flowfield.  For  example,  the  5.6  degree  3D  motion  of 
the  sphere  corresponds  to  a  small,  essentially  linear  translation  in  the  center  of  the  field. 
Discriminating  signal-to-noise  ratios  in  translations  is  related  to  Braddick’s  (1974)  ‘dmax’ 
procedures  for  discriminating  perceived  linear  motion;  it  does  not  necessarily  have  anything  to  do 
with  KDE.  Thus,  although  the  authors  use  response  accuracy  as  their  dependent  variable,  the 
subject’s  ability  to  estimate  a  signal-to-noise  ratio  may  be  artifactual,  and  certainly  is  not  easily 
converted  into  an  estimate  of  the  accuracy  of  KDE. 

Petersik  (1979,  1980)  represented  rotating  spheres  by  surface  elements  that  were  dots  or 
small  vectors.  In  both  studies,  the  spheres  were  displayed  with  polar  projection,  and  subjects 
were  required  to  discriminate  clockwise  from  counterclockwise  rotation.  A  possible  artifact  here 
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is  that  the  motion  of  a  single  stimulus  element  provides  sufficient  information  to  respond 
correctly.  That  is,  under  polar  perspective,  stimulus  points  follow  elliptical  paths  in  the  image 
plane.  To  determine  rotation  direction,  the  subject  needs  only  determine  the  2D  rotation  direction 
of  a  single  point  (assuming  knowledge  of  the  vertical  position  of  the  point  with  respect  to  eye 
level).  Petersik  made  the  task  more  difficult  by  adding  noise  to  some  dot  paths,  by  varying  the 
slant  of  vector  elements  from  frame  to  frame,  or  by  varying  the  numerosity.  However,  none  of 
these  manipulations  prevents  the  subject  from  using  a  purely  2-dimensional,  non-KDE  strategy. 
Indeed,  Braunstein  (1977)  had  previously  examined  precisely  this  point.  Braunstein 
demonstrated  that  only  the  vertical  component  of  the  polar  perspective  transformation  was  used 
by  subjects  for  a  depth-order  judgment,  and  that  this  component  was  sufficient. 

Andersen  &  Braunstein  (1983)  also  used  discrimination  of  rotation  direction  to  evaluate 
KDE.  Their  displays  represented  clumps  of  dots  on  the  surface  of  a  sphere.  A  clump  was 
construed  as  being  bounded  by  an  invisible  pentagon,  whose  presence  was  made  known  by  the 
fact  that,  when  it  lay  on  the  front  surface  of  the  sphere,  it  occluded  dots  which  lay  behind  it  on  the 
rear  surface.  These  spheres  were  displayed  by  parallel  perspective,  and  the  cue  to  depth  order 
(front,  rear)  was  provided  by  occlusion.  Again,  although  the  dependent  variable  is  response 
accuracy,  a  subject  does  not  need  to  perceive  a  3D  object  to  determine  the  direction  of  rotation — 
the  subject  needs  only  to  determine  the  movement  direction  of  the  continuously  visible  clumps. 

In  several  studies,  simple  relative  velocity  cues  are  all  that  the  subject  needs  to  perform  the 
KDE  task.  Braunstein  &  Andersen  (1981)  displayed  a  multi-dot  representation  of  a  dihedral  edge 
that  moved  horizontally.  The  dots  were  displayed  using  polar  projection,  so  that  horizontal  point 
velocities  were  inversely  proportional  to  depth.  Thus,  the  display  contained  a  velocity  gradient 
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that  either  increased  or  decreased  from  the  mid-line  of  the  display  to  the  upper  and  lower  edges  of 
the  display.  Subjects  judged  whether  a  given  display  represented  a  convex  or  concave  edge.  In 
this  task,  comparing  the  relative  velocity  of  points  in  the  center  and  at  the  top  edge  of  the  display 
is  all  that  is  necessary  to  perform  accurately  (the  location  with  the  greater  velocity  is  judged 
“forward”). 

In  experiments  by  Todd,  subjects  determined  which  of  five  curvatures  (Todd,  1984a)  or 
slants  (Todd,  1984b)  were  depicted  in  a  multi-dot  display.  Again,  Todd  described  the  task  in 
terms  of  the  perceived  3D  object,  but  accurate  performance  is  possible  by  comparing  the  relative 
velocities  of  points  in  just  two  areas  of  the  display. 

In  all  the  studies  cited  above,  the  subject  could  perform  the  required  KDE  task  by  using  a 
minimal  artifactual  cue.  One  possible  solution  to  the  problem  of  subjects  learning  to  use 
artifactual  cues  is  to  withhold  feedback.  The  assumption  is  that,  without  feedback,  the  subject 
will  use  only  perceived  3D  shape.  This  approach  has  been  used  extensively  by  Todd  (1982, 
1984a,  1984b).  Unfortunately,  withholding  feedback  does  not  mean  that  the  subject  cannot  use 
an  alternative  perceptual  or  decision  strategy  to  supplement  judgments  of  perceived  KDE  depth. 
One  strategy  that  subjects  often  adopt  without  feedback  is  to  adjust  their  responses  so  as  to 
respond  equally  (or  nearly  equally)  often  with  each  of  the  possible  responses.  For  example, 
Todd’s  (1984a)  procedure  is  vulnerable  to  this  artifact  of  strategy.  He  used  surface  dots  to 
represent  cylinders  with  five  different  curvatures.  On  a  given  trial,  subjects  judged  which  of  the 
five  curvatures  was  presented.  As  an  alternative  to  perceiving  KDE  depth,  a  subject  could  judge 
the  apparent  velocity  of  dots  in  the  center  of  the  display,  and  use  the  knowledge  of  the  velocities 
displayed  on  previous  trials  to  choose  a  curvature  category.  Indeed,  subjects  are  extremely  good 
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at  estimating  the  mean  velocity  and  variations  from  it  in  a  sequence  of  displays  (McKee, 
Silverman  &  Nakayama,  1986).  While  the  subjects’  use  of  a  trivial  strategy  that  estimates  just  a 
single  velocity  per  trial  may  not  explain  the  entirety  of  Todd’s  results,  it  predicts  the  nearly 
veridical  character  of  subject  responses  and  thereby  accounts  for  most  cf  the  data. 

Objective  Measurement  of  KDE:  A  Proposed  Solution 

The  KDE  is  a  perceptual  phenomenon  that  allows  subjects  to  perceive  the  relative  depth  of 
different  positions  in  visual  space,  and  hence  to  infer  the  shapes  of  objects  in  the  environment.  In 
all  of  the  experiments  we  have  discussed,  the  shapes  presented  were  very  simple  (spheres, 
cylinders,  and  planes),  and  hence  simple  response  strategies  would  have  been  effective.  None  of 
the  experiments  discussed  above  requires  the  subject  to  use  a  perceived  3D  shape  in  order  to 
perform  accurately.  In  all  of  the  studies  we  have  reviewed,  subjects  had  the  opportunity  to  use 
artifactual  cues.  None  of  these  experiments  presented  shapes  with  complexity  approaching  that 
seen  in  the  real  world  in  which  the  ability  to  compute  structure  from  motion  evolved. 

in  this  paper,  we  describe  a  new  method  for  investigating  the  kinetic  depth  effect.  Our  aim 
is  to  provide,  instead  of  the  demonstration  of  KDE  by  means  of  perceptual  reports  (what  subjects 
say  they  see),  a  test  of  perceptual  abilities  (what  complex  shape  properties  subjects  can  extract 
from  visual  flow  fields).  The  task  is  shape  identification,  where  on  each  trial,  one  of  a  large 
lexicon  of  shapes  is  presented.  Each  shape  consists  of  a  flat  ground  with  zero,  one,  or  two  bumps 
or  depressions.  The  bumps  and  depressions  vary  in  position,  two-dimensional  extent,  and 
orientation.  Because  of  the  way  the  lexicon  of  shapes  is  constructed,  good  performance  in  the 
shape  identification  task  requires  simultaneous  local  computation  of  velocity  in  many  positions  of 
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the  display  and  global  coordination  of  the  local  informatioa 

Experiment  1.  Dot  Numerosity  and  Bump  Heights 

To  demonstrate  the  shape  identification  method  and  to  investigate  its  limits,  we  replicate 
and  extend  one  of  the  classic  findings  in  multi-dot  KDE:  the  dependence  of  quality  ratings 
(usually  combined  coherence  and  rigidity,  or  ‘goodness’)  on  dot  numerosity  (Green,  1961; 
Braunstein,  1962;  Landy,  Dosher,  and  Sperling,  1985;  Dosher,  Landy,  and  Sperling,  1987). 
Quality  of  KDE  generally  is  found  to  increase  with  dot  numerosity.  Here  we  investigate  the 
effects  of  dot  numerosity  and  depth  extent  on  the  effectiveness  with  which  subjects  can  use  the 
KDE  to  identify  the  target  shape  from  among  its  many  close  competitors. 

Method 

Subjects.  Three  subjects  were  used  in  the  study.  Two  are  authors,  and  the  third  was  a 
graduate  student  naive  to  the  purposes  of  the  experiment.  Two  had  normal  or  corrected-to- 
normal  vision;  CFS  was  correctable  only  to  20:40. 

Displays.  The  shapes  used  in  the  experiment  were  three-dimensional  surfaces  consisting  of 
zero,  one,  or  two  bumps  or  concavities  on  an  otherwise  flat  ground.  Here  we  use  the  term  shape 
to  indicate  the  positions  of  these  bumps  and  concavities  on  the  flat  ground,  irrespective  of  other 
stimulus  parameters  which  were  varied,  including  bump  height,  number  of  dots  used  to  represent 
the  shape,  and  rotation  direction.  The  shapes  were  constructed  as  follows  (see  Fig.  la).  Within  a 
square  area  with  sides  of  length  s ,  a  circle  with  diameter  0.9s  was  centered.  All  depth  values 
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outside  the  circle  were  set  to  zero  (i.e.  in  the  object  base  plane,  which  in  the  initial  display  is  the 
same  as  the  image  plane).  For  each  of  three  positions  inside  the  circle  (located  at  the  vertices  of 
an  equilateral  triangle),  the  depth  was  specified  as  either  +h  (a  distance  h  in  front  of  the  object 
base  plane,  closer  to  the  observer),  0  (in  the  object  base  plane),  or  -h  (behind  the  object  base 
plane).  A  smooth  spline  was  constructed,  using  a  standard  cubic  spline  algorithm,  which  passed 
through  the  flat  surround  and  the  vertices  of  the  triangle.  For  a  given  set  of  vertices,  27  shapes 
were  constructed  in  this  way  (see  Fig.  lb  for  some  examples). 


Insert  Fig.  1  here. 


Two  different  sets  of  vertices  were  used  to  generate  shapes.  These  were  either  at  the 
comers  of  a  triangle  pointing  up  (designated  V)  or  of  a  triangle  pointing  down  (designated  'd'). 
Shapes  were  denoted  by  indicating  the  trio  of  positions  («  or  d),  and  then  specifying  for  each 
position  (in  the  order  shown  in  Fig.  la),  whether  that  position  was  in  front  of  the  object  base 
plane  (*+*),  in  the  plane  (‘0’),  or  behind  it  (*-’).  For  example,  the  shape  denoted  by  'u+-0' 
consists  of  a  bump  in  the  upper-central  area  of  the  display,  a  depression  in  the  lower-left,  and  a 
flat  area  in  the  lower-right  (see  Fig.  lb).  Note  that  ‘uOOO’  and  ‘ dOOO ’  both  designate  the  same 
shape:  a  flat  square.  Fifty  three  distinct  shapes  can  be  generated  in  this  manner. 

Displays  were  generated  for  all  combinations  of  the  53  shapes,  three  dot  numerosities,  and 
three  bump  heights.  For  the  flat  shape  (denoted  'uQOO'  or  'dOOO')  varying  bump  height  has  no 
effect,  and  so  there  are  only  three  flat  shape  display  types  (corresponding  to  the  three 
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numerosities).  For  all  other  shapes  there  are  nine  display  types.  This  results  in  471  display  types. 
For  most  display  types,  a  single  instantiation  was  generated  (choosing  a  set  of  random  dots,  and 
forming  a  display  after  rotation  and  projection).  For  each  of  the  display  types  for  the  flat  shape, 
six  instantiations  were  made.  Thus,  there  were  486  different  displays.  Bump  height,  h,  was  0.5s, 
0.155 ,  or  0.055 ,  where  s  is  the  length  of  a  side  of  the  square  ground.  The  3D  perspective  drawings 
of  the  shapes  in  Fig.  lb  are  for  the  largest  bump  heights.  Dot  numerosities  were  20,  80,  and  320. 
The  bump  height  and  dot  numerosity  manipulations  are  illustrated  in  Figs,  lc  and  1-d, 
respectively. 

Multi-dot  displays  of  these  shapes  were  generated  by  choosing  a  random  sample  of 
positions  on  each  surface,  rotating  the  resulting  set  of  points  about  a  fixed  vertical  axis,  and 
projecting  them  onto  an  image  plane  via  parallel  projection.  The  3D  motion  was  a  single  cycle  of 
a  sinusoidal  rotation  about  a  fixed  vertical  axis  through  the  center  of  the  object  base  plane,  with 
amplitude  of  25  deg  and  period  of  30  frames.  More  specifically,  the  angle  at  which  the  base 

plane  was  oriented  with  respect  to  the  image  plane  was  0(m)  =  ±25sin(^^-)  deg,  where  m  is  the 
frame  number  within  the  30  frame  display. 

Two  rotation  directions  were  used,  indicated  as  7’  and  V’,  corresponding  to  whether  the  left 
or  right  edge  of  the  display  comes  forward  initially.  Equivalently,  this  describes  the  side  of  the 
observer  to  which  the  shape  'faces’  in  the  second  half  of  the  rotation  (which  is  usually  an  easier 
way  to  code  the  response).  For  an  rotation  (see  Fig.  le),  the  object  initially  appeared  face- 
forward.  It  was  then  routed  so  that  the  front  moved  to  the  right  until  the  object  had  routed  25 
deg.  Then  it  reversed  direction  and  routed  to  the  left  until  it  was  25  deg  to  the  left  of  its  initial 
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orientation.  Finally,  it  again  reversed  direction  and  rotated  until  the  ground  plane  was  again 
perpendicular  to  the  line  of  sight  A  full  description  of  a  display  by  a  subject  includes  the 
indication  of  the  set  of  vertices  ( u  or  d),  the  3D  depths  at  these  vertices  (+,-,0),  and  the 
direction  of  rotation  (/  or  r),  for  example,  'u+-0l' . 

Because  of  the  parallel  projection,  simultaneous  reversal  of  depth  signs  and  of  rotation 
direction  yields  precisely  the  same  physical  image  sequence.  The  486  displays  described  above 
were  all  generated  with  the  T  rotation,  but  each  can  equally  well  be  described  as  an  V  rotation 
of  the  sign-reversed  shape.  There  are  108  ways  to  designate  a  display  by  combining  an  up  or 
down  shape-type  with  a  bump,  depression,  or  flat  surface  at  three  different  locations  with  a  left  or 
right  initial  direction  of  motion;  that  is,  {d,u}  x{+,-,0px{l ,r}.  For  most  shapes,  there  are  two 
equally  valid  ways  to  describe  the  display.  For  example,  ‘u+~0l'  and  ‘u-+0r'  describe  the  same 
display.  The  flat  shape  is  denoted  equally  accurately  as  uOOOl,  uOOOr,  dOOOl,  and  dOOOr.  Given 
the  four  instantiations  of  the  flat  shape,  chance  performance  depends  on  subject  strategy. 
Repeated  responses  of  uOOOl  (and  its  equivalents)  yields  a  guaranteed  performance  of  18  in  486 
correct  (or  2  in  54).  Random  guessing  yields  an  expected  performance  of  just  over  1  in  54 
correct.  Subjects  did  not  designate  bump  height  in  their  responses.  Except  in  the  case  of  the  flat 
stimuli,  bump  height  was  obvious. 

After  sampling,  rotation,  and  projection,  any  given  frame  of  the  display  consisted  of  n 
points  in  the  image  plane.  These  points  were  displayed  as  bright  dots  on  a  dark  background.  The 
square  image  extent  of  the  displays  projected  to  a  182  x  182  pixel  area  subtending  4.0  deg  of 
visual  angle.  The  display.0  /ere  not  windowed  in  any  way,  so  the  edges  of  the  display  oscillated 
in  and  out  with  the  rot>  t=.  With  the  25  deg  wiggle,  at  the  instants  when  rotation  reverses,  the 
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display  has  shrunk  to  90%  of  its  initial  horizontal  extent. 

Displays  were  presented  on  a  background  that  was  uniformly  dark  (approximately  .001 
cd/m  ).  Dots  were  single  pixels  of  approximately  65  pcd  and  were  viewed  from  a  distance  of  1.6 
m.  A  trial  sequence  consisted  of  a  cue/fixation  spot  presented  for  1  sec,  a  1  sec  blank  interval, 
and  the  2-sec  stimulus  sequence.  The  stimulus  sequence  was  followed  by  a  blank  screen,  the 
luminance  of  which  was  the  same  as  the  background  of  the  stimulus.  The  display  was  run  at  60 
Hz  noninterlaced.  Each  display  frame  was  repeated  four  times,  for  an  effective  rate  of  15  new 
frames  per  second.  The  duration  of  each  30-frame  display  was  2  sec. 

Apparatus.  Stimuli  were  computed  in  advance  of  the  session  and  stored  on  disk.  The 
stimuli  were  processed  for  display  by  an  Adage  RDS-3000  image  display  system  and  were 
displayed  on  a  Conrac  721109  RGB  color  monitor.  The  stimuli  appeared  as  white  dots  on  a 
black  background. 

Viewing  Conditions.  Stimuli  were  viewed  monocularly  (with  the  dominant  eye)  through  a 
black  cloth  viewing  tunnel.  In  order  to  minimize  absolute  distance  cues,  a  circular  aperture 
slightly  larger  than  the  square  display  area  restricted  the  field  of  view.  Stimuli  were  viewed  from 
a  distance  of  1.6  m.  After  each  stimulus  presentation,  the  subject  typed  a  response  on  a  computer 
terminal.  Room  illumination  was  dim  (illuminance  was  approximately  8  cd/m2). 

Procedure.  Subjects  were  shown  perspective  drawings  of  the  shapes  (as  in  Fig.  lb),  and 
were  instructed  as  to  how  they  were  constructed  and  named.  They  were  told  that  they  would  be 
shown  multi-dot  versions  of  these  shapes,  and  would  be  required  to  name  the  shape  displayed  and 
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its  rotation  direction  as  accurately  as  possible.  They  were  told  to  use  any  method  they  chose  to 
remember  and  apply  the  shape  and  rotation  designations. 

Each  of  the  486  displays  was  viewed  once  by  each  subject.  The  displays  were  presented  in 
a  mixed-list  design  in  four  sessions  of  45  min.  After  each  response,  the  possible  correct  responses 
were  listed  as  feedback.  For  each  stimulus,  there  were  always  two  responses  that  were  scored  as 
correct  (given  perceptual  reversals).  For  the  flat  stimuli,  four  possible  answers  were  correct 

To  become  familiar  with  the  task  and  the  method  of  response,  each  subject  ran  trials 
consisting  of  27  of  the  easiest  stimuli  (the  320  dot  0.5s  height  stimuli).  Subjects  ran  until 
accuracy  was  at  least  85%  correct  (approximately  100  to  130  trials). 

Results 

Accuracy  data.  All  subjects  reported  that  they  perceived  a  3D  surface  the  first  and  every 
subsequent  time  they  viewed  the  high  numerosity  displays.  With  low  numerosities,  the  dots  were 
perceived  in  approximately  their  correct  positions  in  3D  space,  but  there  were  too  few  dots  to 
give  the  illusion  of  a  continuous  surface  or  to  discriminate  unambiguously  between  alternative 
responses.  The  very  limited  practice  served  merely  to  teach  the  subjects  to  name  the  perceived 
shapes  without  having  to  refer  to  drawings. 

The  results  of  Experiment  1  are  summarized  in  Fig.  2.  Each  response  was  scored  as  correct 
only  if  both  the  shape  and  the  rotation  direction  were  correct  and  consistent.  Thus,  if  'u+-0l' 
was  the  display,  responses  u+-0l  and  u-+0r  were  correct.  Every  other  response  was  incorrect 
There  were  occasional  responses  with  the  correct  shape  and  the  incorrect  rotation  direction  (66 
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such  errors,  4.5%  of  all  responses,  10%  of  all  errors).  Subjects  later  indicated  that  most  of  these 
were  a  result  of  forgetting  the  direction  of  rotation  before  the  response  was  completed,  rather  than 
from  a  truly  mis-rotating  percept.  Nevertheless,  such  responses  were  treated  as  incorrect. 


Insert  Fig.  2  here. 


As  expected,  accuracy  improved  both  with  the  numerosity  and  with  the  amount  of  depth 
displayed.  There  were  signs  of  a  ceiling  in  performance  as  numerosity  increased.  For  two 
subjects,  for  320  point  displays,  the  curves  crossed,  and  the  middle-range  depth  extent  (0.15s) 
was  as  good  or  better  than  the  large  0.5s  depth  extent.  An  ANOVA  was  computed  treating 
numerosity,  height,  and  subjects  as  treatments,  and  shapes/rotations  as  the  experimental  units. 
Both  numerosity  and  degree  of  depth  are  highly  significant  (p  <  .0001).  Subjects  differ 
significantly  from  one  another  (p  <  .0001).  The  three-way  interaction  was  significant  (p  <  .01), 
indicating  that  the  interaction  of  height  and  number  differed  among  subjects  (see  Fig.  2).  No 
two-way  interactions  were  significant. 

Error  analyses.  A  confusion  matrix  was  computed,  pooled  across  subjects,  the  nine 
conditions,  two  rotation  directions,  and  two  possible  designations  of  each  shape  or  depth 
reversals  (it  was  thus  a  27  x  27  =  729  cell  matrix).  Table  1  is  a  summary  of  these  identification 
errors.  Descriptions  are  given  for  seven  common  error  types,  one  uncommon  error  type,  and  a 
miscellaneous  category.  If  a  bump  and  a  depression  were  present  in  the  display,  and  only  one  of 
the  two  was  indicated  by  the  subject,  this  was  called  a  ‘Missed  Feature  Error’.  If  the  bump  and 
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depression  are  of  equal  extent  on  the  base  plane  (e.g.  ‘«+-0’>,  then  this  is  called  a  ‘Missed  Equal 
Size  Feature’.  If  they  are  of  unequal  extent,  and  the  smaller  of  the  two  is  not  reported,  this  is 
categorized  as  a  ‘Missed  Smaller  Feature’.  Any  display  containing  only  one  depth  sign  (such  as 
‘ u+00 ’)  reported  as  containing  both  depth  signs  (e.g.  ‘u0+-’)  is  categorized  as  ‘Report  Two 
Depth  Signs  When  There  Was  Only  One’.  For  a  given  row  in  the  table,  the  second  column 
presents  examples  of  errors  of  that  type.  The  third  column  lists  the  number  of  cells  in  the 
confusion  matrix  which  correspond  to  an  error  of  a  given  type,  while  the  fourth  column  provides 
the  total  number  of  errors  that  occurred  over  all  cells  of  that  type.  The  last  column  is  the  average 
number  of  errors  per  cell  in  cells  of  that  type,  computed  as  the  ratio  of  the  number  of  trials 
indicated  in  column  four  divided  by  the  number  of  cells  in  column  three.  In  total,  there  were  586 
errors;  divided  by  702  error  cells  this  yields  0.83  errors  per  cell  on  the  average.  A  ratio  greater 
than  0.83  in  column  5  of  Table  1  indicates  an  error  type  more  common  than  the  average,  a 
smaller  number  indicates  a  less  common  than  average  error  type. 


Insert  Table  1  here. 


The  bottom  row  of  the  table  provides  summary  information.  The  first  seven  error  types 
listed  had  ratios  well  over  this  value,  and  thus  were  more  common  than  other  errors.  The  ‘Report 
Two  Depth  Signs  ...’  error  type  is  an  example  of  an  exceedingly  uncommon  error. 

An  insufficient  quantity  of  data  was  collected  to  enable  us  confidently  to  draw  many 
specific  conclusions  from  the  error  data.  The  hypothesis  that  errors  are  distributed  uniformly 
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across  the  nine  error  classes  is  easily  rejected  (x 2  =  1032,  df=8,  p  <  .001).  It  appears  that  four 
types  of  errors  were  the  most  prevalent.  Large  single  bumps  were  highly  confusible,  especially 
the  subtle  difference  in  shape  that  distinguishes  'd+++’  from  *«+++’,  but  also  that  between 
‘d+++’  and  'd0++\ etc.  Errors  were  made  in  horizontal  location  of  the  shape  within  the  ground 
(e.g.  'd0+0'  was  reported  as  being  'u+00\  or  ‘d++0’  as  ‘ u+0+ ’).  Errors  were  also  made  in 
judging  the  width  of  the  bumps  (e.g.  ‘ d+00 ’  reported  as  'u0++').  Finally,  where  both  a  bump 
and  a  concavity  were  present,  occasionally  one  of  the  two  was  not  noticed.  It  is  interesting  that  in 
every  case  of  this  type  of  error  (the  ‘Missed  Smaller  Feature’s  and  ‘Missed  Equal  Size  Feature’s 
of  Table  1,  and  the  less  common  missed  larger  features),  the  response  was  of  a  single  bump 
toward  the  observer.  In  other  words,  in  the  presence  of  a  perceived  convexity,  a  concavity  is 
occasionally  missed,  but  not  the  other  way  around.  On  the  other  hand,  when  only  one  nonzero 
depth  was  present  (a  single  bump  or  concavity),  it  was  very  rare  for  subjects  to  give  a  response 
containing  multiple  depth  signs. 

When  the  confusion  matrix  is  broken  down  by  experimental  condition,  the  amount  of  data 
is  rather  low.  Nevertheless,  a  few  interesting  trends  are  evident.  First,  all  seven  common  error 
types  (the  first  seven  rows  of  Table  1),  remain  common  in  all  experimental  conditions.  As  the 
task  becomes  more  difficult,  the  types  of  errors  subjects  make  remain  ‘sensible’.  Second,  the  first 
two  error  types,  although  common  in  difficult  conditions  Oow  height  or  low  numerosity),  become 
even  more  common  in  easier  conditions.  As  the  shape  impression  improves,  the  subjects  are  able 
to  eliminate  other  possible  shapes  and  then  are  more  likely  to  err  by  choosing  the  most  similar 
incorrect  shape.  The  distinction  between  *d+++’  and  '«+++’  is  very  difficult  even  when  the 
perception  of  depth  is  quite  compelling  and  well-sampled.  The  ‘Report  Two  Depth  Signs...’  error 
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type  is  uncommon  in  all  conditions,  but  there  appears  to  be  a  trend  for  this  error  type  to  become 
more  common  as  numerosity  increases. 

Experiment  2.  Texture  Density. 

Several  cues  may  lead  to  correct  shape  identification  in  the  KDE  task.  One  cue  is  dynamic 
changes  in  texture  density.  The  shapes  are  generated  in  such  a  manner  that,  face-on,  the  expected 
local  dot  density  across  the  display  is  uniform.  By  itself,  the  initial  frame  has  no  shape 
information  whatsoever.  As  the  shape  rotates,  areas  in  the  display  becomer  more  dense  or  sparse 
as  the  areas  in  the  shape  that  they  portray  become  more  or  less  slanted  from  the  observer. 
Theoretically,  the  observer  could  use  this  cue  from  subsequent  frames  after  the  first  to  determine 
the  shape.  Since  we  are  interested  in  structure-from-motion,  the  changing  texture  density  adds  a 
cue  in  addition  to  the  relative  motion  cue.  In  Experiment  2,  we  compare  three  conditions:  (1) 
Both  the  motion  and  density  cues  are  present  as  before;  (2)  Only  the  motion  cue  is  present  —  dot 
lifetimes  are  varied  in  such  a  way  as  to  eliminate  the  density  cue  by  keeping  local  average  dot 
density  constant  across  the  display;  and  (3)  Only  the  density  cue  is  present  —  the  relative  motion 
cue  is  eliminated  by  reducing  dot  lifetimes  to  just  one  frame. 

Method 

Subjects.  Three  subjects  were  used  in  the  study.  One  was  an  author,  two  were  graduate 
students  naive  to  the  purposes  of  the  experiment  Two  had  corrected-to-normal  vision;  CFS  was 
correctable  only  to  20:40. 


Sperling,  Landy,  Dosher,  &  Perkins 


Identifying  Shape  by  KDE 


Page  19 


Displays.  The  displays  were  generated  in  a  manner  similar  to  Expt.  1.  The  same  lexicon  of 
53  shapes  was  used.  The  flat  ground  surrounding  each  shape  was  extended  horizontally  by  20%, 
and  was  later  windowed  to  the  same  182  x  182  pixel,  4  deg  square,  so  that  the  sides  of  the 
displays  no  longer  oscillated  with  the  rotation.  Instead,  points  appeared  and  disappeared  at  the 
edges  of  the  window.  For  each  shape,  an  instantiation  of  the  shape  was  made  with  10,000  points, 
and  with  the  large  0.5r  bump  height  of  Expt.  1.  Displays  for  each  of  the  three  experimental 
conditions  were  made  by  randomly  subsampling  points  from  this  rotating  10,000-dot  shape. 

(1 )  Control  condition:  Motion  and  texture  cues.  The  control  condition  has  both  the  relative 
motion  and  changing  texture  density  cues.  A  small  random  subsample  of  points  is  chosen,  so  that 
approximately  320  points  are  visible  through  the  4  deg  square  window.  The  subsample  of  points 
is  rotated  and  projected  as  before,  and  then  clipped  so  as  to  display  only  those  points  within  the 
window.  This  condition  is  identical  to  the  easiest  condition  of  Expt.  1  (0.5s,  320  dots)  except  for 
the  windowing  (and  the  lower  dot  contrast  described  below).  Examples  of  the  density  cue 
available  in  these  displays  are  shown  in  Fig.  3. 


Insert  Fig.  3  here. 


(2)  Only  motion  cue.  This  main  experimental  condition  removes  the  changing  texture 
density  cue  (Fig.  3).  The  4  deg  x  4  deg  square  window  is  treated  as  consisting  of  a  10  x  10  grid 
of  subsquares.  Texture  density  is  kept  uniform  by  forcing  each  subsquare  to  contain  exactly  3 
points  in  every  display  frame.  Thus,  there  are  exactly  300  points  visible  in  every  frame.  On  the 
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first  frame,  300  of  the  10,000  points  are  randomly  chosen,  subject  to  the  constraint  that  exactly  3 
points  are  chosen  in  each  subsquare.  On  each  subsequent  frame,  the  10,000  points  are  rotated  by 
the  proper  amount.  Then,  for  each  of  the  100  subsquares,  the  points  (of  the  300)  which  now 
appear  in  each  subsquare  are  counted.  If  more  than  three  would  occur,  points  are  randomly 
chosen  and  marked  as  no  longer  displayed,  until  the  number  of  displayed  points  in  that  subsquare 
falls  to  three.  If  less  than  three  points  in  a  grid  square  would  be  displayed,  then  more  points  are 
randomly  chosen  (from  the  10,000)  that  currently  would  appear  in  that  subsquare  to  bring  the 
total  back  up  to  three.  In  this  condition,  dot  density  remains  uniform  throughout  the  display. 
Points  are  deleted  or  reinstated  only  as  needed  to  keep  the  density  uniform.  Although  variations 
in  texture  density  are  noticeable  in  the  control  displays,  the  exclusion  of  the  density  cue  does  not 
seriously  disrupt  the  correspondence  of  the  majority  of  the  points:  most  points  remain  displayed 
for  10  frames  or  more  of  the  30  frame  display. 

The  amount  of  scintillation  is  small.  The  average  change  (one  half  of  total  dot  additions 
plus  deletions)  between  two  frames  is  16,  for  320  dot  displays  this  is  5.0%  scintillation.  (The 
highest  between-frame  scintillation  was  8.3%.) 

(3)  Only  texture  density  cue.  The  relative  motion  cue  is  removed  in  this  condition  leaving 
the  changing  texture-density  cue  intact.  For  each  frame  in  the  display,  320  of  the  10,000  points 
are  randomly  chosen.  This  happens  independently  on  every  single  frame,  subject  to  the 
constraint  that  no  point  ever  appears  in  two  successive  frames.  Thus,  no  relative  motion  cues  are 
available  in  these  displays,  which  look  like  dynamic  sparse  random  dot  noise.  On  the  other  hand, 
since  the  points  are  chosen  randomly  from  the  10,000  points,  they  have  the  same  expected  texture 
density  as  the  10,000  points  on  each  frame,  and  indeed  become  more  dense  and  sparse  in  exactly 
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the  same  fashion  as  in  the  first  experimental  condition  (as  illustrated  in  Fig.  3). 

There  are  53  possible  shapes  and  three  experimental  conditions,  resulting  in  159  display 
types.  Two  different  displays  are  made  of  each  display  type  of  the  flat  shape,  and  one  display  for 
all  other  display  types.  There  were  thus  162  displays.  They  were  displayed  as  bright  green  dots 
on  a  green  background  of  lesser  luminance.  The  display  background  luminance  was  31  cd/m2. 
Each  dot  added  an  additional  13  pcd,  viewed  from  a  distance  of  1.6  m.  All  other  display 
characteristics  were  the  same  as  in  Expt.  1. 

Apparatus.  The  apparatus  was  the  same  as  in  Expt.  1.  Only  the  green  channel  of  the 
Conrac  display  monitor  was  used. 

Viewing  Conditions.  The  viewing  conditions  were  identical  to  Expt.  1. 

Procedure.  There  were  eleven  experimental  conditions:  the  three  described  above  (motion 
and  texture,  motion  only,  texture  only)  and  eight  others  which  will  be  reported  elsewhere.  There 
was  thus  a  total  of  594  displays,  including  the  162  displays  of  the  three  conditions  reported  here. 
These  were  presented  in  a  mixed-list  design  in  four  sessions  of  one  hour  each.  Otherwise,  the 
procedure  was  identical  to  Expt.  1. 

Results 

Density  cue.  The  results  are  shown  in  Fig.  4.  For  two  subjects  (MSL  and  CFS), 
elimination  of  the  changing  density  cue  does  not  alter  performance.  For  the  third  subject  (JBL), 
performance  drops  from  81.5%  to  68.5%  when  the  density  cue  is  eliminated.  However,  it  is  not 
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clear  whether  this  small  performance  change  is  due  to  the  elimination  of  the  density  cue  itself  or 
the  introduction  of  scintillation  (dot  noncorrespondences)  by  the  process  of  eliminating  density 
cues.  For  two  subjects  (CFS  and  JBL),  the  elimination  of  the  relative  motion  cue  in  the  density 
only  condition  drops  performance  to  levels  that  do  not  differ  significantly  from  chance.  For  the 
third  subject  (MSL),  performance  with  the  density  cue  alone  is  significantly  above  chance, 
although  well  below  performance  for  conditions  in  which  the  relative  motion  cue  is  available. 


Insert  Fig.  4  here. 


In  the  condition  in  which  only  the  changing  dot  density  cue  is  available,  the  displays  do  not 
look  three-dimensional.  The  only  subject  (an  author)  who  was  able  to  perform  significantly 
above  chance  in  this  condition  was  highly  familiar  with  the  construction  of  the  displays.  For  any 
given  shape  and  rotation  direction,  clumps  of  higher  density  appear  first  on  one  side  of  the 
display,  and  then  later  on  the  other  side,  as  the  object  is  rotated  an  equal  amount  in  both 
directions  from  the  initial  face-forward  orientation  through  the  course  of  the  30-frame  display. 
Performance  was  a  matter  of  noting  the  positions  in  the  display  at  which  high  density  occurred, 
on  which  side  of  the  display  they  occurred  first,  and  the  2D  shape  of  the  texture  clump.  Then,  a 
response  was  chosen  which  was  most  consistent  with  this  information.  This  was  a  highly 
cognitive  task,  and  it  took  far  longer  to  respond  in  this  condition  as  a  result. 

Changing  dot  density  is  neither  a  necessary,  nor  a  sufficient  cue  for  the  perception  of  3D 
shape  with  these  displays.  However,  when  the  density  cue  was  available  with  motion  cues,  the 
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density  cue  may  have  been  used  by  one  of  three  subjects  to  slightly  improve  his  responses.  When 
the  density  cue  was  the  only  cue,  another  one  of  three  subjects  was  able  to  improve  his  response 
accuracy  to  significantly  above  chance.  These  results  point  out  the  importance  of  removing 
artifactual  cues  from  kinetic  depth  displays. 

Scintillation  cue.  In  the  constant  density  condition,  one  might  argue  that  the  subject  is 
indirectly  provided  with  shape  information  by  the  amount  of  scintillation  (dot 
noncorrespondence)  in  different  areas  of  the  display.  Local  scintillation  could  potentially  be  used 
by  a  subject  (just  as  density  information  was  useful  to  one  of  three  subjects  in  the  density-only 
condition). 

The  relation  between  local  scintillation  in  these  displays  and  local  density  (and  thereby, 
ultimately,  local  shape)  in  the  control  displays  is  not  simple.  Points  are  deleted  or  added  only 
when  necessary  to  keep  constant  the  number  of  points  in  a  given  locale.  The  number  of  points 
that  will  be  added  (or  deleted)  is  thus  proportional  to  the  the  local  rate  of  change  of  texture 
density.  The  difficulty  in  computing  shape  from  scintillation  is  that  subjects  are  poor  at  judging 
the  degree  of  scintillation  in  a  pattern,  other  than  differentiating  some  scintillation  from  no 
scintillation  (Lappin,  Doner,  and  Kottas,  1980).  And  it  even  more  difficult  to  determine  whether 
scintillation  is  due  to  points  being  added  or  to  points  being  subtracted,  that  is,  to  determine  the 
sign  of  the  change  of  texture  density. 

We  further  investigated  the  possibility  that  scintillation  might  have  been  a  useful  cue,  in  an 
informal  experiment  Various  amounts  of  irrelevant  scintillation  (in  the  form  of  fresh,  randomly 
occurring  dots  in  each  frame)  was  added  to  all  areas  of  each  frame.  With  added  scintillation  that 
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was  10  times  more  than  that  produced  by  the  density  removal  program,  the  quality  of  the  image 
was  greatly  impaired.  But  the  ability  to  discriminate  shapes  seemed  to  be  unimpaired.  This 
means  that  scintillation  is  relatively  unimportant:  Large  amounts  do  not  greatly  impair  the 
display;  small  amount  are  not  necessary  to  perceive  KDE  because,  when  they  are  masked  by 
large  amounts  of  scintillation,  performance  hardly  suffers. 

In  displays  similar  to  those  of  Experiment  2,  restricting  dots  to  have  lifetimes  of  only  3 
frames  is  another  operation  that  generates  large  amounts  of  scintillation.  KDE  identification 
performance  remains  high  even  though  the  amount  of  scintillation  is  large  and  varies  randomly 
throughout  the  display  and  from  frame  to  frame  (Dosher,  Landy,  and  Sperling,  1988;  Landy, 
Sperling,  Dosher,  and  Perkins,  1987).  All  in  all,  the  difficulty  subjects  have  in  estimating  the 
amount  of  scintillation  in  the  first  place  and  the  subsequent  difficulty  of  any  computation  for 
estimating  shape  from  scintillation  make  it  unlikely  that  scintillation  plays  a  significant  role.  We 
conclude  that  density-related  shape  cues  are  eliminated  in  the  motion-only  displays. 

Experiment  3.  Equivalent  2D  Task. 

Because  of  the  large  set  of  shapes,  the  systematic  way  in  which  it  was  constructed,  and  the 
large  set  of  possible  responses,  it  appears  difficult  to  perform  accurately  in  this  task  without  a 
global  perception  of  shape.  Indeed,  except  in  the  case  of  the  density-only  displays  of  Expt.  2,  all 
of  our  subjects  reported  perceiving  a  global  shape  and  basing  their  response  on  this  global  shape 
percept.  Nevertheless,  one  of  our  most  serious  objections  to  previous  studies  of  KDE  was  that 
the  subjects  could  have  performed  the  experimental  tasks  without  a  global  perception  of  shape  by 
using  minimal,  incidental  cues.  Because  the  our  set  of  shapes  is  finite  (53  shapes),  there  are 
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indeed  potential  artifactual  strateties;  however,  because  each  realization  of  a  shape  is  composed 
of  different  random  dots,  we  were  unable  to  discover  any  simple,  minimal  computation  for  our 
task.  The  simplest  computation  is  equivalent  to  what  we  believe  the  KDE  computation  itself  to 
be. 


To  study  alternative  mental  computations  that  might  yield  correct  responses  in  our  KDE 
task,  we  developed  a  new  display  that  did  not  produce  the  3D  depth  percept  of  KDE  but  that  was 
as  equivalent  as  possible  to  the  KDE  display  in  other  respects.  To  perform  correctly  with  the  new 
display,  the  subject  would  have  to  perform  a  computation  that  was  equivalent  to  the  KDE 
computation  except  in  that  it  is  performed  by  some  other  perceptual/cognitive  process,  a  process 
that  did  not  yield  perceptual  depth.  We  call  such  a  computation  a  KDE-alternative  computation. 

Suppose  that  a  subject  chose  to  perform  the  shape  identification  task  by  measuring 
instantaneous  velocities  at  only  a  small  number  of  spatial  positions,  and  making  this  velocity 
determination  at  only  a  single  moment  during  the  motion  sequence,  for  example,  a  moment  at 
which  velocities  were  the  greatest.  A  high  velocity  indicates  a  point  far  forward  or  far  behind  the 
base  plane.  Opposite  velocities  indicate  points  at  opposite  depths.  Using  these  simple  principles, 
it  is  obvious  that  velocity  measurements  at  six  positions  —  the  comers  of  both  triangles  used  in 
specifying  the  shapes  —  would  be  sufficient  to  identify  the  shapes.  Indeed,  looking  for  velocity 
maxima  (which  indicate  depth  extrema)  is  a  quite  general  principle.  Fewer  measurements  of 
velocity  made  at  intermediate  points  would  suffice  for  identification  of  our  restricted  set  of 
stimuli  but  they  would  involve  unrealistically  complicated  computations  that  were  specific  to  this 
stimulus  set. 
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In  Experiment  3,  we  evaluate  a  computation  for  shape  reconstruction  based  on  a  strategy  of 
making  six  simultaneous  local  velocity  measurements  at  the  points  that  correspond  to  the 
possible  depth  extrema  in  our  stimulus  set. 

Method 

Choosing  motion  trajectories  for  display.  In  the  shape  identification  task  (Fig.  1), 
suppose  one  were  to  track  a  single  point  on  the  surface  of  the  shape  throughout  the  course  of  the 
display.  Initially  the  point  is  at  position  (x,y  ,z),  where  x  and  y  are  the  horizontal  and  vertical 
image  plane  axes,  respectively,  and  z  is  the  depth  axis.  As  in  Experiments  1  and  2,  assume  that 

the  shape  is  rotated  about  the  y  axis  according  to  8(m)  =  ±25 sin(-^p),  where  m  is  the  frame 
number.  Under  parallel  projection,  the  motion  path  of  the  point  is  purely  horizontal: 

x(m)  =  r  cos  -^^(0o±25sin(^^))  ,  where  r  =  ^x2  +  z2  ,  and  0O=  \2sc\zlx)  deg. 

If  a  subject  were  to  apply  the  local  motion  strategy  to  the  shape  identification  task,  he  would  need 
to  measure  and  categorize  local  velocity  for  six  such  motion  paths  simultaneously.  In 
Experiment  3,  the  subjects  were  presented  directly  with  stimuli  containing  six  moving  patches 
and  they  ware  requested  to  categorize  the  local  directions  of  motion. 

Displays.  Each  display  is  based  upon  a  particular  shape  from  the  shape  identification  task. 
Each  of  the  six  motion  paths  portrayed  in  the  display  is  based  upon  a  motion  path  followed  by  a 
critical  point  on  the  surface  of  the  shape,  as  just  described.  The  six  critical  points  were  the 
projections  onto  the  surface  of  the  six  points  originally  used  to  generate  the  shapes  (see  Fig.  la, 
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‘u’  and  ‘d’).  The  motion  paths  were  based  on  the  shapes  with  the  largest  heights  (h  =  0.5 s), 
where  s  is  the  width  of  the  visible  background  plane. 

The  displays  were  intended  to  force  subjects  to  use  the  strategy  of  simultaneously 
measuring  six  velocities,  without  any  possibility  of  recourse  to  using  perceived  3D  shape.  Each 
display  consisted  of  6  patches  of  moving  random  dots  (Fig.  5).  The  dots  within  a  patch  all  moved 
with  the  same  velocity,  and  patches  were  spatially  separated,  so  that  there  was  no  perception  of 
depth.  The  outline  squares  of  Fig.  5  were  not  directly  visible  to  the  subject.  They  acted  as 
windows  through  which  planes  of  moving  random  dots  were  seen.  Due  to  a  setup  error,  dot 
density  in  Expt.  3  was  slightly  less  (0.83  rather  than  equal  to)  the  density  used  in  the  constant 
density  condition  of  Expt.  2.  (This  density  difference  is  so  small  that  it  went  unnoticed  at  the 
time.) 


Insert  Fig.  5  here. 


Response  mapping.  There  were  two  rows  of  three  patches  of  moving  dots.  Figure  5 
indicates  the  correspondence  of  patch  position  to  where  that  patch’s  motion  is  visible  in  the 
original  shape  displays.  Spatial  positions  in  Expts.  2  and  3  are  essentially  similar  except  that  the 
middle  positions  in  each  row  of  Expt  2  displays  were  interchanged  to  create  the  Expt.  3  displays. 
This  was  done  in  order  to  make  the  response  easier  for  the  subjects.  With  the  KDE  shape 
displays,  the  subject  decides  whether  the  three  important  points  are  those  of  the  V  or  'd'  triangle, 
and  then  categorizes  the  height  at  each  of  the  three  comers  of  that  triangle.  In  the  corresponding 


Sperling,  Landy,  Dosher,  &  Perkins 


Identifying  Shape  by  KDE 


Page  28 


motion  task,  the  subject  decides  whether  the  top  or  bottom  row  of  patches  is  most  important,  and 
then  categorizes  the  motion  path  of  each  patch  in  that  row. 

For  points  at  a  reasonable  height  above  the  base  plane,  the  2D  motion  path  is  quasi- 
sinusoidal.  That  is,  points  move  to  the  left,  then  to  the  right,  then  return  leftward  to  their  starting 
position  (or  right,  then  left,  then  right).  Points  with  a  larger  initial  z  value  move  faster.  The 
extreme  z  values  generate  the  highest  speeds  and  these  always  lie  above  the  vertices  of  the  base 
triangle  used  to  generate  the  shape.  This  means  that  subjects  can  solve  the  motion  task  by  first 
judging  which  row  contains  the  fastest  speed,  and  then,  for  that  row,  categorizing  the  motion  in 
each  of  the  three  patches  about  halfway  through  the  course  of  the  display  time.  Each  patch  was  to 
be  labeled  as  moving  quickly  to  the  left  ('/’),  quickly  to  the  right  (V).  or  slowly,  if  at  all  (‘0’). 
Note  that  points  in  the  other  row  also  move  in  a  quasi-sinusoidal  manner,  but  more  slowly  than 
the  maximum  speed  in  the  relevant  row. 

One  possible  response  is,  for  example,  ulrO.  This  indicates  that  the  fastest  speeds  are  in  the 
upper  row,  the  upper-left  patch  moved  ‘right-then-left-then-right’,  the  upper-middle  patch  moved 
‘left-then-right-then-left’,  and  the  upper-right  patch  was  neariy  stationary.  There  are  54  possible 
responses  (2  rows,  3  possible  motion  categories  for  each  of  three  patches  in  that  row).  U000  and 
dOOO  denote  the  same  display,  one  in  which  all  patches  are  nearly  stationary,  resulting  in  53 
distinct  display  types,  corresponding  to  the  53  distinct  shape-and-rotation  display  types  in  the 
shape  identification  experiment. 

There  are  53  possible  shapes.  With  two  exemplars  of  the  the  flat  shape,  and  one  for  all 
other  shapes,  this  yielded  54  displays.  Motion  displays  were  displayed  as  bright  white  dots  on  a 
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grey  background.  The  display  background  luminance  was  15.6  cd/m2.  Each  dot  added  an 
additional  24.3  \i  cd,  viewed  from  a  distance  of  1.6  m.  All  other  display  characteristics  were  the 
same  as  in  Expt.  1. 

Apparatus.  The  apparatus  was  die  same  as  in  Expt.  1,  except  that  a  monochrome  U.S. 
Pixel  PX15H315LHS  monitor  with  a  fast,  white  phosphor  was  used. 

Viewing  Conditions.  Stimuli  were  viewed  monocularly  with  goggles;  a  circular  aperture 

restricted  the  field  of  view.  Luminance  outside  the  aperture  was  approximately  equal  to  the 

2 

background  luminance  on  the  CRT,  which  was  15.6  cd/m  .  Stimuli  were  viewed  from  a  distance 
of  1.6  m.  After  each  stimulus  presentation,  the  subject  keyed  responses  using  response  buttons, 
and  visual  feedback  was  given  on  the  CRT.  The  room  was  dark,  but  light  adaptation  level  was 
controlled  by  the  CRT  background  and  the  illumination  of  the  occluding  screen. 

Procedure.  A  block  of  trials  consisted  of  108  trials.  Each  of  the  54  displays  was  viewed 
twice  in  random  order.  For  the  stimuli  based  on  the  flat  shape,  two  possible  answers  were  correct 
(uOOO  and  dQOO).  For  all  other  stimuli  only  one  answer  was  correct. 

Subjects  were  told  precisely  the  correct  strategy  to  employ.  They  were  told  that  they  would 
see  six  patches  of  moving  dots.  They  were  to  determine  which  row  contained  the  patch  with  the 
fastest  motion  (either  the  upper  row,  designated  V,  or  the  lower  row,  designated  'd').  For  that 
row,  subjects  were  to  categorize  the  motion  in  each  of  the  three  patches  in  that  row  as  measured 
about  halfway  through  the  course  of  the  display  time.  Each  patch  was  to  be  labeled  as  moving 
quickly  to  the  left  (T),  quickly  to  the  right  (‘r’),  or  slowly  if  at  all  (‘0’)-  After  each  response,  the 
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correct  answers  were  displayed  as  feedback.  Other  details  of  the  procedure  ware  identical  to 
Expi  1. 

Subjects.  Two  subjects  were  used  in  the  study.  One  was  an  author,  one  was  a  graduate 
student  naive  to  the  purposes  of  the  experiment.  Both  had  corrected-to-normal  vision.  Subject 
MSL  ran  a  single  block  of  108  trials.  Subject  JBL  ran  three  blocks  of  108  trials. 

Results 

Subject  MSL  scored  90.7%  correct  on  a  single  block  of  108  trials.  In  the  three  blocks  of 
trials  run  by  subject  JBL,  the  scores  were  58.3%,  75.9%,  and  88.0%,  respectively.  Indeed,  after  a 
little  practice,  performance  was  quite  good,  equal  or  slightly  better  than  performance  in  the 
easiest  conditions  of  Expts.  1  and  2,  which  had  a  comparable  dot  density  and  range  of  velocities. 

There  were  too  few  trials  to  make  an  in-depth  analysis  of  error  data.  However,  the  most 
frequent  motion  response  errors  corresponded  to  the  two  most  frequent  KDE  errors  in  Table  1 
(small  distortions  or  mislocalizations  of  large  bumps).  For  example,  8  out  of  the  10  errors  made 
by  MSL  were  the  analogues  of  these  two  error  types.  Examples:  * ulU\  a  triple  ‘up’  bump  was 
reported  as  ‘dill',  a  triple  ’down’  bump;  ‘ulOV  was  reported  as  ‘d0l0’  -  a  double  bump  was 
mistaken  for  a  single  bump  in  the  same  location  (see  Fig.  1).  Indeed,  these  results  are  not 
surprising  since  the  velocities  involved  in  Expts.  1  and  3  are  similar.  It  seems  likely  that  a  very 
large  number  of  trials  would  be  required  to  find  any  significant  differences  in  the  error  patterns  in 
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ExpL  3  and  those  of  Expt.  1. 

Discussion 

We  have  introduced  a  new  objective  task  for  measuring  the  perceptual  effectiveness  of  the 
kinetic  depth  effect:  shape  identification.  With  the  current  lexicon  of  shapes,  it  measures 
whether  the  subject  can  globally  determine  precisely  which  areas  are  in  front  of  the  ground  and 
which  are  behind.  We  consider  here  some  possible  objections  to  and  some  issues  raised  by  our 
results. 

Cues  to  Structure  from  Motion 

Optic  flow  or  interpoint  distances?  In  the  displays  of  Experiment  2,  in  which  dot  density 
was  controlled,  subjects  solved  the  shape  identification  task  even  though  no  single  frame 
contained  any  information  that  could  have  been  used  to  infer  shape.  For  these  stimuli,  at  least 
two  frames  are  needed  to  infer  shape.  By  definition  then,  the  only  possible  cues  are  motion  cues. 

There  are  at  least  two  possible  motion  cues  to  depth:  optic  flow  and  changing  interpoint 
distances  in  our  displays.  That  is,  subjects  could  be  deriving  shape  from  a  global  optic  flow  field 
(instantaneous  velocity  vector  measurements  across  the  field)  or  from  measurement  of  interpoint 
distances  of  particular  dots  over  two  or  more  frames.  Models  of  the  KDE  have  been  based  on 
both  optic  flow  (Koenderink  &  van  Doom,  1986)  and  on  interpoint  distances  (Ullman,  1984; 
Hildreth  &  Greywacz,  1986;  Landy,  1987).  To  a  certain  extent,  it  is  possible  to  differentiate 
between  these  models  by  creating  displays  in  which  dots  have  lifetimes  of  only  two  frames.  In 
such  displays,  a  global  optic  flow  field  is  available  (although  noisy),  and  3D  structure  could,  in 
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principle,  be  computed  from  the  flow  field.  Alternatively,  some  subset  of  the  points  could  have 
been  used  to  compute  a  3D  object  based  on  interpoint  distances.  However,  the  particular  object 
changes  rapidly  since  within  2  frames  all  points  have  been  replaced  by  entirely  new  points, 
uncorrelated  with  those  of  the  preceding  frames.  It  turns  out  that  subjects  are  quite  adept  at  the 
shape  identification  task  with  such  displays  (Dosher,  Landy,  and  Sperling,  1988;  Landy,  Sperling, 
Dosher,  and  Perkins,  1987).  This,  and  related  results,  are  taken  as  strong  evidence  against  the 
interpoint  distance  models  (Dosher  et  al.,  1988;  Landy  et  al.,  1987).  Together  with  the  results  of 
the  present  experiment,  in  which  changing  density  was  eliminated  as  an  alternative,  this  leaves 
motion  flow  fields  as  the  necessary  and  sufficient  cue  for  KDE  in  moving-dot  displays.  Whether 
interpoint  distances  or  other  motion  cues  are  ever  perceptually  salient  remain  open  questions. 

Multiple  facets  of  the  KDE.  We  have  previously  argued  (Landy,  Dosher,  and  Sperling, 
1985;  Dosher,  Landy  and  Sperling,  1987)  that  measurement  of  the  full  effect  of  stimulus 
manipulations  on  the  KDE  requires  several  subject  responses  in  order  fully  to  describe  the 
richness  of  the  percept.  These  responses  included  judgments  of  coherence  (whether  the  multi-dot 
stimulus  coheres  as  a  single  object),  rigidity  (does  the  object  stretch?),  and  depth  extent  (what  is 
the  amount  of  depth  perceived).  These  different  aspects  of  the  percept  are  partially  correlated, 
but  they  can  be  decoupled  by  suitable  display  manipulations.  For  example,  with  some  subjects, 
the  addition  of  exaggerated  polar  perspective  to  a  display  increases  the  perceived  depth  extent 
even  as  it  decreases  perceived  rigidity. 

In  the  current  experiments,  this  richness  of  the  KDE  percept  was  not  explored.  We 
measured  the  extent  to  which  the  display  was  effective  in  creating  a  global  sensation  of  depth, 
and  hence  supported  objective  shape  identification.  Other  aspects  such  as  depth  extent  or  rigidity 
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were  not  measured.  The  difference  between  the  three  depth  conditions  was  immediately  obvious 
to  subjects,  and  increasing  the  depth  extent  displayed  (within  certain  limits)  did  improve 
performance,  but  we  did  not  measure  perceived  depth  extent. 

Although  perceived  rigidity  was  not  explicitly  measured,  non-rigid  percepts  were 
spontaneously  reported  by  subjects.  One  particular  example  was  very  commoa  Shapes  with 
both  bumps  and  concavities  (e.g.  M++-)  were  occasionally  seen  in  a  non-rigid  mode.  Rather  than 
seeing  one  area  forward,  another  back,  and  the  whole  thing  rigidly  rotating,  observers  perceived 
both  areas  as  being  in  front  of  the  object  ground,  rotating  in  opposite  directions  (this  percept 
looks  rather  like  a  mitten  with  the  thumb  and  finger  portions  alternately  grasping  and  opening). 
This  particular  non-rigid  percept  occurred  most  often  when  the  number  of  dots  was  large  and  the 
depth  extent  was  at  its  largest  In  this  stimulus  condition,  with  mixed-sign  shapes,  it  is  clearly 
visible  that  the  two  bumps  cross  (in  the  rigid  mode,  one  sees  through  the  bump  to  the  concavity 
behind  it  when  they  cross).  This  is  an  example  of  a  failure  of  the  ‘rigidity  hypothesis’  (Ullman, 
1979;  Schwartz  &  Sperling,  1983;  Braunstein  &  Andersen,  1984;  Adelson,  1985;  Dosher, 
Sperling,  &  Wurst,  1986),  since  a  stimulus  that  has  a  perfect  rigid  inteipretation  is  perceived  as 
non-rigid.  (It  should  be  noted  that  the  nonrigid  interpretation  also  is  a  veridical  3D  interpretation 
that  is  consistent  with  the  2D  stimulus;  it  happens  not  to  match  the  required  response  mapping.) 
These  stimuli  are  multi-stable,  yielding  more  than  two  possible  stable  percepts.  When  subjects 
perceived  a  non-rigid  object,  they  were  required  to  compute  the  name  of  one  of  the  possible  rigid 
objects  that  was  consistent  with  what  they  perceived. 

Relations  to  previous  empirical  studies.  We  found  that  shape  identification  performance 
increases  with  the  number  of  dots  displayed  and  the  extent  of  depth  portrayed.  Neither  of  these 
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results  is  surprising.  The  numerosity  result  is  an  extension  of  previous,  more  subjective,  measures 
of  the  depth  perceived  in  simple  KDE  displays  (Green,  1961;  Braunstein,  1962).  Increasing  the 
number  of  dots  provides  the  observer  with  more  samples  of  the  motion  of  the  shape  portrayed. 
Increasing  depth  extent  increases  the  range  of  velocities  used.  Both  manipulations  increase  the 
observer’s  signal-to-noise  ratio  in  the  task,  where  noise  sources  may  be  both  external  (such  as 
position  quantization  in  the  display  and  sparse  shape  sampling)  and  internal. 

What  is  Computed  in  KDE? 

Within  measurement  error,  subjects  performed  equally  well  in  the  motion  judgment  task  of 
Experiment  3  and  comparable  KDE  tasks  of  Experiments  1  and  2.  Further,  the  most  common 
confusion  error  is  the  same  in  both  experiments.  And  there  is  every  reason  to  suppose  that,  if 
more  data  were  available,  the  less  common  errors  also  would  be  highly  correlated.  In  brief,  we 
have  succeeded  in  creating  two  equivalent  tasks  for  classifying  stimuli  into  53  shape  categories: 
one  which  is  solved  by  a  KDE  mechanism  that  yields  a  perceived  3D  shape,  one  which  is  solved 
by  a  motion  perception  mechanism  that  yields  a  perceived  pattern  of  2D  motions.  What  does  this 
imply  about  the  mechanism  of  KDE  and  about  the  technology  of  KDE  experimentation? 

Although  the  specific  nature  of  the  perceptual  algorithm  that  extracts  3D  structure  from  2D 
motion  has  not  yet  been  established,  it  is  reasonable  to  expect  that  it  ultimately  will  be.  Whatever 
the  computation,  the  equivalent  computation  could,  in  principle,  be  carried  out  by  some  other 
system  that  was  supplied  with  the  same  raw  information,  in  this  instance,  the  optical  flow  fields. 
In  Experiment  3,  we  demonstrated  that  the  measurements  of  the  optic  flow  fields  at  six  points 
provide  sufficient  information  for  the  shape  categorization  task.  When  the  optic  flow  at  these 
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locations  is  provided  to  observers  in  a  response-compatible  format,  they  can  use  this  optic  flow 
information  to  categorize  the  stimuli  in  perceived  2D  just  as  efficiently  as  when  they  categorize 
KDE  stimuli  in  perceived  3D.  What  is  special  about  extracting  structure  from  motion  is  not  the 
informational  capacity  of  the  KDE  system,  but  the  perceptual  capacity  for  extracting  the  relevant 
information  and  providing  it  perceptually  as  3D  depth. 

For  extracting  structure  from  motion,  the  relevant  information  is  optic  flow.  This  was 
demonstrated  in  Experiment  2  (where  the  residual  nonflow  cues  were  eliminated)  and  by 
experiments  in  which  dots  were  given  maximum  lifetimes  of  only  two  (or  three)  frames  so  that 
correspondence  cues  were  weakened  and  only  optic  flow  cues  survived  (Dosher  et  al,  1988; 
Landy  et  al.,  1987).  The  relevant  information  in  our  particular  shape  discrimination  task  is  the  set 
of  local  velocity  minima  and  maxima  in  the  optic  flow.  A  reasonable  assumption  about  the 
structure-from-motion  computation  is  that  the  perceptual  system  automatically  locates  these 
maxima  and  minima,  extracts  the  velocities,  and  transforms  them  into  perceived  depths. 
[Relative  velocity  has  long  been  recognized  as  an  extremely  potent  depth  cue  (e.g.,  Helmholtz, 
1924,  p  295ff;  Rogers  &  Graham,  1979)  and  undoubtedly  is  a  critical  component  of  KDE.]  When 
the  relevant  areas  of  optical  flow  are  extracted  instead  by  our  display  processor  and  presented  to 
the  subject  as  isolated  patches,  the  subject  is  still  able  to  classify  the  velocity  in  the  patches  but 
the  automatic  perceptual  conversion  of  velocity  into  perceived  depth  is  inhibited.  Nevertheless, 
the  extracted  velocity  information  is  sufficient  to  enable  accurate  classification  of  the  stimuli 


Sperling,  Landy,  Dosher,  &  Perkins 


Identifying  Shape  by  KDE 


Page  36 


when  a  response-compatible  format  is  made  available. 


Insert  Fig.  6  here. 


Figure  6  illustrates  the  processes  that  are  assumed  to  be  involved  in  object  recognition  via 
the  KDE.  From  the  stimulus,  the  subject  extracts  a  2D  velocity  flow  field.  The  KDE  is  the 
process  whereby  3D  depth  values  are  extracted  from  the  flow  field.  These  depth  values  are 
combined  with  other  shape  and  contour  information  from  the  stimulus  to  yield  a  3D  object 
percept  which  then  forms  the  basis  for  the  subject’s  response.  A  KDE-altemative  computation  is 
one  that  uses  the  same  stimulus  and  velocity  flow  field,  but  circumvents  the  KDE  computation  by 
deriving  the  required  response  directly  from  the  flow  field.  Experiment  3  demonstrated  that  a 
KDE-altemative  computation  would  be  possible  in  principle  if  the  subject  could  extract  the 
velocities  at  the  six  most  relevant  locations. 

In  transforming  flow-field  velocity  into  perceived  depth,  there  is  an  inherent  ambiguity  in 
sign:  a  given  velocity  can  equally  well  indicate  depth  toward  or  away  from  the  observer.  This 
ambiguity  is  inherent  in  the  optics  of  the  display  and  reflected  in  our  scoring  procedure. 
However,  the  perceptual  system  tends  to  resolve  the  ambiguity  consistently  in  nearby  locations. 
On  those  occasions  where  it  does  not,  e.g.,  when  it  interprets  leftward  motion  as  closer  in  one 
display  area  and  as  further  in  another,  the  display  appears  to  be  grossly  nonrigid.  The  likelihood 
of  consistent  depth  interpretation  has  been  studied  by  GiUam  (1972,  1976)  and  probably  can  be 
modeled  by  locally  connected  cooperative-competition  netwoiks  (see  Sperling,  1981  for  an 
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overview  of  cooperation-competition  in  binocular  vision,  Williams  and  Phillips,  1987  for  an 
example  of  cooperation  in  motion  perception). 

KDE-alternative  computations. 

It  is  useful  to  distinguish  three  kinds  of  computations:  KDE,  KDE-altematives,  and 
artifactual  non- KDE  computations.  The  KDE  computation  is  an  automatic  perceptual 
computation  made,  in  the  case  of  our  stimuli,  on  velocity  flow-fields,  and  it  results  in  perceived 
depth  (a  3D  percept)  at  those  visual  field  locations  where  it  is  successful.  A  KDE-altemative 
computation  is  a  computation  on  velocity  flow-fields  similar  to  the  KDE  computation  except  that 
it  is  made  consciously  in  some  other  part  of  the  brain.  It  results  in  a  knowledge  of  the  correct 
response,  but  it  does  not  yield  perceived  depth-the  field  is  perceived  as  flat.  An  artifactual,  non- 
KDE  computation  uses  an  incidental  property  of  the  display  to  compute  se  correct  response,  and 
the  computation  may  be  quite  unrelated  to  the  KDE  computation.  For  example,  the  various 
objective  studies  of  KDE  we  considered  in  the  Introduction  all  were  vulnerable  to  computations 
that  used  only  a  small  portion  -  in  some  instances  only  the  movement  of  a  single  dot  -  of  the 
stimulus  information  that  would  have  been  required  by  a  KDE  computation. 

Of  the  five  studies  reviewed  in  the  Introduction,  the  possible  artifactual  computations 
involved  1  dot  (one  study),  2  dots  (two  studies),  and  other  cues  (two  studies).  The  problem  is 
purely  technical;  the  possible  artifactual  computations  are  quite  different  from  KDE 
computations.  There  is  a  great  risk  of  admitting  an  artifactual  computation  when  the  set  of 
possible  stimuli  is  small  and  when  the  required  KDE  computation  itself  is  relatively  simple. 
Even  though  subjects  in  these  studies  may  have  perceived  KDE  depth,  a  simple  2D  strategy 
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would  have  improved  response  accuracy.  While  some  of  these  procedures  could  have  been 
improved,  we  deemed  it  better,  from  the  outset,  to  use  a  large  set  of  stimuli  that  can  be  identified 
only  after  a  relatively  elaborate  KDE  computation.  What  distinguishes  the  present  task  from 
prior  tasks  is  that  they  admitted  artifactual  computations  that  were  short-cuts  to  the  correct 
response;  the  present  alternative  computation  is  an  equivalent  computation  to  KDE. 

With  respect  to  KDE-equivalent  computations,  we  can  ask  two  questions:  do  they  ever 
occur,  and  if  they  do,  how  can  we  be  sure  that  they  do  not  always  occur.  To  demonstrate  that  a 
KDE-equivalent  computation  can  occur  we  first  have  to  know  what  the  KDE  computation  itself 
is,  and  then  to  perturb  the  stimulus  so  that  the  automatic  KDE  computation  cannot  occur.  In  our 
experiment  (and  probably  more  generally),  the  essential  KDE  computation  is  the  discovery  of 
local  velocity  minima  and  maxima,  and  the  consistent  depth-labeling  of  these  minima  and 
maxima.  In  Experiment  3,  the  stimulus  areas  that  contained  velocity  extrema  were  extracted 
from  the  KDE  stimulus  and  (in  order  to  avoid  the  automatic  KDE  computation)  they  were 
presented  as  isolated  squares.  The  subjects  were  able  to  label  these  areas  consistently  with 
respect  to  velocity  (not  depth,  since  the  display  was  perceived  as  flat).  Thus,  subjects  performed 
a  KDE-equivalent  task  by  means  of  a  KDE-equivalent  computation.  Furthermore,  the  pattern  of 
errors  in  the  equivalent  task  corresponded  to  the  previous  error  pattern  in  the  KDE  task.  While 
there  are  necessarily  some  differences  between  the  KDE-stimuli  and  the  alternative  stimuli,  our 
strong  result  makes  it  clear  that,  along  with  artifactual  computations,  the  possibility  of  a  KDE- 
altemative  computation  has  to  be  considered  in  interpreting  KDE  experiments. 

Artifactual  computations  are  most  easily  discriminated  from  KDE  computations  by  varying 
stimulus  parameters.  Stimulus  cues  that  might  support  an  artifactual  computation  are  removed. 
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masked,  or  are  rendered  useless  by  irrelevant  variation.  If  response  accuracy  survives,  we  have 
increased  confidence  that  it  is  based  on  a  KDE  computation. 

KDE  and  KDE-altemative  computations  use  the  same  stimulus  attributes;  they  differ  in 
where  in  the  brain  the  computation  is  made.  Two  tools  for  discriminating  between  these 
computations  are  introspection  and  dual  tasks.  For  example,  all  subjects,  without  conscious 
effort,  immediately  perceive  our  KDE  stimuli  as  solid  3D  objects.  When  subjects  honestly  report 
that  they  perceive  3D  depth  in  dynamic  KDE  stimuli,  by  definition,  they  have  performed  a  KDE 
computation.  The  problem  is  that  KDE  may  not  be  the  only  computation  being  performed.  For 
complex  stimuli  such  as  ours,  however,  it  is  hard  to  imagine  that  a  subject  could  be  performing  a 
useful  alternative  computation  without  awareness.  Indeed,  the  discovery  of  an  alternative 
computation  for  KDE  is  the  structure-from-motion  problem,  and  the  solutioi  proposed  in 
Experiment  3  may  be  the  first  workable  solution  for  stimuli  of  this  type.  It  would  be  remarkable 
if  subjects,  even  sophisticated  subjects,  discovered  it  in  the  course  of  viewing  the  stimuli.  Still, 
even  in  this  case,  but  especially  with  simpler  stimuli,  it  would  be  better  to  use  a  formal  procedure 
to  exclude  alternative  computations.  This  requires,  for  example,  (1)  isolating  the  alternative 
computation  (as  in  Experiment  3),  (2)  finding  a  concurrent  task  or  similar  manipulation  that 
selectively  interferes  with  the  alternative  computation  relative  to  the  direct  KDE-computation,  (3) 
using  the  modified  or  dual  tasks  with  the  original  stimuli. 

An  alternative  KDE  computation  is  analogous  to  an  alternative  stereoptic  depth 
computation  that  is  carried  out  by  monocularly  examining  the  left  and  right  members  of  a 
stereogram.  When  stimuli  are  designed  to  take  advantage  of  the  exquisite  sensitivity  of 
stereopsis,  an  alternative  monocular  computation  that  uses  remembered  disparities  is  not  feasible, 
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even  though  it  might  be  leamable  in  special  cases.  The  same  is  undoubtedly  true  for  KTE  and 
alternative  KDE  computations:  for  complex  KDE  stimuli,  viewed  briefly,  the  alternative 
computation  is  simply  out  of  the  question.  However,  the  problem  in  interpreting  experimental 
rer  'ts  h^s  not  been  alternative  KDE  computations  but  artifactual  non-KDE  computations.  The 
best  way  to  avoid  subsequent  problems  of  interpretation  is  to  use  complex  stimuli,  like  the  53- 
shape  stimulus  set  used  here,  that  are  matched  to  and  challenge  the  ability  of  the  human  KDE 
computation. 

Summary  and  Conclusions 

A  new  shape  identification  task  for  measuring  KDE  performance  is  proposed.  With  its 
lexicon  of  53  shapes,  accurate  identification  requires  either  an  accurate  3D  shape  percept  or  a 
KDE-altemative  computation  based  on  simultaneous  measurements  of  2D  velocity  in  six 
positions  of  the  display.  Performance  in  the  shape  identification  task  improved  with  increased 
numerosity  in  a  multi-dot  display,  and  with  an  increase  in  the  amount  of  depth  portrayed.  Shape 
identification  was  not  mediated  by  incidental  texture-density  cues  but  rather  by  motion  cues 
derived  from  optic  flow.  The  objective  shape  identification  task  is  proposed  as  a  sensitive 
measure  of  the  critical  aspect  of  kinetic  depth  performance.  It  is  proposed  that  the  structure  - 
from-motion  algorithm  used  by  subjects  to  solve  the  KDE  shape  identification  task  involves 
finding  local  2D  velocity  minima  and  maxima  and  assigning  depth  values  to  these  locations  in 
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consistent  proportion  to  their  velocities. 
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Figure  Legends 

Figure  1.  Stimulus  shapes,  rotations,  and  their  designations.  Shapes  were  constructed  by 
smoothly  splining  a  flat  ground  and  three  points  which  were  either  toward  the  observer  (’+’).  in 
the  flat  ground  CO'),  or  away  from  the  observer  (a)  These  three  points  were  at  the  comers 
of  one  of  two  possible  equilateral  triangles,  where  the  odd  point  is  up  (V),  or  the  odd  point  is 
down  Cd').  In  the  experiment,  subjects  were  required  to  name  the  shape  and  rotation  direction 
perceived.  The  numbers  specify  the  order  in  which  the  depth  signs  of  the  three  points  are  to  be 
reported,  (b)  The  various  combinations  result  in  a  lexicon  of  53  shapes;  typical  examples  are 
illustrated  here  as  perspective  plots.  The  orientation  of  these  plots  relative  to  the  viewing 
direction  is  indicated  on  the  first  example,  (c)  Three  bump  heights  were  used:  0.5s,  0.15s,  and 
0.05s ,  where  s  is  the  length  of  a  side  of  the  square  base  of  the  shape.  The  shape  depicted  here  is 
u+  ++.  (d)  Three  dot  numerosities  were  used:  20,  80,  and  320.  Pictured  are  the  first  frames  of  a 
representative  display  in  each  numerosity  condition,  (e)  Two  rigid  rotation  motions  were 
simulated.  Both  were  sinusoidal  rotations  about  a  vertical  axis  through  the  center  of  the  object 
ground.  The  object  either  first  rotated  to  face  the  subject’s  right,  then  to  the  subject’s  left,  then 
returned  face- forward  (T),  or  in  the  opposite  direction  (V). 

Figure  2.  Performance  on  the  shape  identification  task  as  number  of  points  in  the  simulated 
shape  was  varied.  The  parameter  is  the  height  of  the  bumps  relative  to  the  length  of  a  side.  Each 
panel  represents  data  from  a  different  subject.  Performance  increased  with  both  numerosity  and 
bump  height. 
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Figure  3.  The  dynamic  density  cue.  Three  frames  are  shown  from  a  display  corresponding 
to  u+O+r,  a  bump  in  the  top  center  and  a  second  bump  on  the  lower  right.  The  upper  row  shows 
frames  with  the  density  cue.  The  lower  row  illustrates  the  effectiveness  of  removing  the  density 
cue  in  the  motion-only  condition. 

Figure  4.  Percent  of  correct  shape-and-rotation  identifications  for  the  three  cue  conditions  of 
Expt.  2.  Data  are  shown  for  3  subjects. 

Figure  5.  Spatial  layout  of  the  stimuli  used  in  Expt.  3.  The  squares  represent  windows 
through  which  fields  of  moving  random  dots  were  seen.  The  outline  of  the  windows  was  not 
visible  to  the  subject.  The  label  under  each  window  denotes  the  position  in  the  shape  (as  in  Fig. 
la)  which  controlled  the  motion  portrayed  in  that  window.  For  example,  the  motion  path  of  all 
the  random  dots  seen  in  the  upper-middle  window  was  the  same  as  that  taken  by  the  point  in  a 
shape  display  of  Expt.  1  which  was  initially  above  position  7’  in  the  'd'  triangle  shown  in  Fig. 
la. 


Figure  6.  Flowchart  for  KDE,  KDE-altemative,  and  artifactual  computations.  From  the 
stimulus,  the  following  are  assumed  to  be  computed  in  sequence:  2D  velocity  flow  field,  3D 
depth  values  (KDE  computation),  a  3D  object  representation  (which  in  this  instance  happens  not 
to  correspond  perfectly  with  the  object  represented  by  the  stimulus),  and  the  required  response 
sequence.  The  KDE-altemative  computation  computes  the  required  response  sequence  directly 
from  the  2D  optic  flow  without  an  intermediate  stage  of  perceived  3D  depth;  i.e.,  it  simulates  the 
KDE  computation  in  another  part  of  the  brain.  An  artifactual  computation  uses  incidental 
stimulus  cues  or  motion  cues  from  only  a  small  part  of  the  stimulus  to  arrive  at  a  response. 
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Table  1.  Summary  of  Identification  Eitots,  Pooled  over 
Subjects,  Bump  Heights,  Dot  rymritfe*,  Rotation  Directions,  and  Depth  Reversals. 


Description  Examples  Number  Number  Ratio8 

ufCeiis  of  Errors 


Small  Distortions 
of  Large 
Bumps 

*+++ v*d+++ 

2 

29 

14.5 

Incorrect  Bump 
Width,  Correa 
Location 

U0++  vs  d+00 

4 

34 

8.5 

Missed  Smaller 
Features 

u++-  reported  as  u++0 

6 

30 

5.0 

Diagonal  Bump 
Reported  as 
Large  Bump 

u++0  reported  as  «+++  or  d+++ 

8 

23 

2.9 

Missed  Equal 
Size  Feature 

u+0-  reported  as  u+00 

12 

29 

2.4 

Incorrect 
Diagonal 
Bump  Size 

u++-  reported  as  u+0- 

8 

16 

2.0 

Small  Horizontal 
Location  Error 

u+00  vs  d0+0 

16 

27 

1.7 

Report  Two 
Depth  Signs 
When  There 
Was  Only  One 

*+00  reported 
as  u+-0 

168 

40 

0.24 

Other  Errors 

478 

358 

0.75 

All  Errors 

702 

586 

0.83 

*Total  number  of  indicated  error  responses  divided  by  total  number  of 
applicable  cells  (Column  4/Column  3).  A  ratio  greater  than  0.83  indicates 
a  type  of  error  that  is  more  common  than  average. 


Figure  2.  Performance  on  the  shape  identification  task  as  number  of  points  in  the 
simulated  shape  was  varied.  The  parameter  is  the  height  of  the  bumps  relative  to  the 
length  of  a  side.  Each  panel  represents  data  from  a  different  subject.  Performance 
increased  with  both  numerosity  and  bump  height. 
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Figure  3.  The  dynamic  density 

cue.  Three  frames 

are  shown  from  a  display 

corresponding  to  u+O+r,  a  bump  in  the  top  center  and  a  second  bump  on  the  lower  right. 
The  upper  row  shows  frames  with  the  density  cue.  The  lower  row  illustrates  the 
effectiveness  of  removing  the  density  cue  in  the  motion-only  condition. 


Percent  Correct 


Figure  4.  Percent  of  correct  shape-and-rotation  identifications  for  the  three  cue  conditions 
of  Expt.  2.  Data  are  shown  for  3  subjects. 
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Figure  5.  Spatial  layout  of  the  stimuli  used  in  Expt.  3.  The  squares  represent  windows 
through  which  fields  of  moving  random  dots  were  seen.  The  outline  of  the  windows  was 
not  visible  to  the  subject.  The  label  under  each  window  denotes  the  position  in  the  shape 
(as  in  Fig.  la)  which  controlled  the  motion  portrayed  in  that  window.  For  example,  the 
motion  path  of  all  the  random  dots  seen  in  the  upper-middle  window  was  the  same  as  that 
taken  by  the  point  in  a  shape  display  of  Expt.  1  which  was  initially  above  position  in 
the  ld'  triangle  shown  in  Fig.  la. 
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Figure  6.  Flowchart  for  KDE  and  KDE-altemative  computations.  From  the  stimulus,  the 
following  are  assumed  to  be  computed  in  sequence:  2D  velocity  flow  field,  3D  depth 
values  (KDE  computation),  a  3D  object  representation  (which  in  this  instance  happens 
not  to  correspond  perfectly  with  the  object  represented  by  the  stimulus),  and  the  required 
response  sequence.  The  KDE-altemative  computation  computes  the  required  response 
sequence  directly  from  the  2D  optic  flow. 


